{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-colab"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/pathwaycom/pathway/blob/main/examples/notebooks/showcases/fuzzy_join_part1.ipynb\" target=\"_parent\"><img src=\"https://pathway.com/assets/colab-badge.svg\" alt=\"Run In Colab\" class=\"inline\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "notebook-instructions",
      "source": [
        "# Installing Pathway with Python 3.10+\n",
        "\n",
        "In the cell below, we install Pathway into a Python 3.10+ Linux runtime.\n",
        "\n",
        "> **If you are running in Google Colab, please run the colab notebook (Ctrl+F9)**, disregarding the 'not authored by Google' warning.\n",
        "> \n",
        "> **The installation and loading time is less than 1 minute**.\n"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "id": "pip-installation-pathway",
      "source": [
        "%%capture --no-display\n",
        "!pip install --prefer-binary pathway"
      ],
      "execution_count": null,
      "outputs": [],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "id": "logging",
      "source": [
        "import logging\n",
        "\n",
        "logging.basicConfig(level=logging.CRITICAL)"
      ],
      "execution_count": null,
      "outputs": [],
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "id": "2",
      "metadata": {
        "jp-MarkdownHeadingCollapsed": true
      },
      "source": [
        "# Automating reconciliation of messy financial transaction logs using Pathway's real-time fuzzy join\n",
        "\n",
        "## Fuzzy joins: 'errare humanum est'\n",
        "\n",
        "As the ancient maxim says, ['errare humanum est'](https://en.wiktionary.org/w/index.php?title=errare_humanum_est): to err is human.\n",
        "More than two thousands years later, this lesson is still very accurate in our modern world.\n",
        "Everyone makes mistakes and writing does not escape this fate: the longer the text the more mistakes there will be.\n",
        "However, most mistakes we usually make are small and do not hinder understanding.\n",
        "\n",
        "Unfortunately, computers, just like accountants, don't like mistakes. Computers cannot cope with mistakes. No matter how small the mistake, the computer will just reject the whole answer and throw an error.\n",
        "You have written your 10-digit password but finished with a lower case 'a' instead of a capital 'A'? The passwords obviously do not match, and you shall enter your password again!\n",
        "\n",
        "While this zero tolerance policy may make sense for security processes, it can be terrible when users have to enter long texts.\n",
        "For example, accountants may have to enter long logs of transactions by hand, creating many opportunities for mistakes.\n",
        "If those logs have to be compared to other logs (e.g. a log automatically generated by a pay station) then mismatches would appear: 'mr' instead of 'Mr'.\n",
        "Mistakes can also come from the way the data has been collected: using nicknames instead of full names, different email addresses etc.\n",
        "While humans could be able to match those logs despite the mistakes, computers cannot.\n",
        "\n",
        "Does it mean the computer is helpless in those cases, shifting all the tedious work of matching similar but different entries to human?\n",
        "Fortunately not, several mechanisms exist to assist or even perform the matching, and **fuzzy join** is one of them: a fuzzy join is a process which automatically matches entries from different logs despite not having a perfect matching between their keys."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "3",
      "metadata": {},
      "source": [
        "## Fuzzy join in Pathway\n",
        "\n",
        "Fuzzy join is used to perform a join on datasets when the keys do not match exactly.\n",
        "Simple use cases include matching lower case strings with camelCase strings or matching\n",
        "floats with some precision threshold.\n",
        "\n",
        "Pathway's standard library comes with a powerful `smart_fuzzy_join` functionality.\n",
        "This tutorial is a showcase of its capabilities. We will develop a Data Application which allows for fuzzy-joining\n",
        "two streams of data against each other, and also for maintaining audit entries and updating results on the fly. Here is a sneak preview:\n",
        "\n",
        "![Demo animation](https://pathway.com/assets/content/showcases/fuzzy_join/demo.gif)\n",
        "\n",
        "## The data\n",
        "\n",
        "We will be doing the fuzzy-join between two datasets on money transfers\u2019 banking logs.\n",
        "When doing banking or bookkeeping, this operation would be known as [reconciliation](https://en.wikipedia.org/w/index.php?title=Reconciliation_(accounting)&oldid=1100237463) of\n",
        "two sets of transactions records.\n",
        "One dataset comes in a perfectly organized format - csv, the other dataset consists of\n",
        "'human written' lines describing the transactions.\n",
        "\n",
        "\n",
        "Here are samples from the datasets:\n",
        "\n",
        " **Data sourced automatically from a bank feed, in 'standard' CSV format**\n",
        "\n",
        "|id    |date      |amount|recipient |sender        |recipient_acc_no            |sender_acc_no               |\n",
        "|------|----------|------|----------|--------------|----------------------------|----------------------------|\n",
        "|0     |2020-06-04|8946  |M. Perez  |Jessie Roberts|HU30186000000000000008280573|ES2314520000000006226902    |\n",
        "|1     |2014-08-06|8529  |C. Barnard|Mario Miller  |ES8300590000000002968016    |PL59879710390000000009681693|\n",
        "|2     |2017-01-22|5048  |S. Card   |James Paletta |PL65889200090000000009197250|PL46193013890000000009427616|\n",
        "|3     |2020-09-15|7541  |C. Baxter |Hector Haley  |PL40881800090000000005784046|DE84733500000003419377      |\n",
        "|4     |2019-05-25|3580  |L. Prouse |Ronald Adams  |PL44124061590000000008986827|SI54028570008259759         |\n",
        "\n",
        "\n",
        "The first dataset is sourced automatically from a bank feed. Every few seconds a new batch of transactions is saved to `transactions/formatA/batch_timestamp.csv`.\n",
        "\n",
        " **Transaction logs entered by hand**\n",
        "\n",
        "|id |description|\n",
        "|---|-----------|\n",
        "|0  |Received 8521 \u20ac  on 2014-08-07 by INTERNATIONAL interest payment from ??? to C. Barnard, recipient acc. no. 000002968016 by BANCO DE MADRID, amount EUR \u20ac, flat fee 8 \u20ac |\n",
        "|1  |EUR 8944 on 2020-06-06 by INTERNATIONAL transfer credited to 00000000008280573 (M. Perez) by BNP Paribas Securities Services,  fee EUR 2, amount EUR 8946. |\n",
        "|2  |Finally got 5M quid on 2017-01-23 by DOMESTIC payment from Sergio Marquina to Bella Ciao, r. acc. 0000000009197250, oryg. amount 5_000_048, fees 5 quid. |\n",
        "|3  |3578 EUR am 2019-05-25 von INTERNATIONAL dividend payment by Pathway Inc. an L. Prouse, Empf\u00e4ngerkonto 8986827, Betrag 3580 EUR |\n",
        "|4  |Received 7540 EUR on 2020-09-15. Invoice, recipient C. Baxter, 0000000005784046, amount EUR 7541, fees EUR 1 |\n",
        "\n",
        "\n",
        "As you can see, it seems that each entry in the first dataset (data sourced automatically) has a corresponding entry in the other dataset (transaction logs entered by hand).\n",
        "In this example we will use the `smart_fuzzy_join` function from Pathway's standard library to make sure all is correctly matched.\n",
        "\n",
        "## What are we going to obtain?\n",
        "We want to obtain a table in which the matchings are expressed, e.g. the entry 0 for the first table corresponds to the entry 1 in the second table.\n",
        "In addition, we will include the confidence, a number expressing how confident we are in the matching.\n",
        "\n",
        "## Code\n",
        "First things first - imports:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "4",
      "metadata": {
        "lines_to_next_cell": 2
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n",
        "\n",
        "import pathway as pw\n",
        "\n",
        "# To use advanced features with Pathway Scale, get your free license key from\n",
        "# https://pathway.com/features and paste it below.\n",
        "# To use Pathway Community, comment out the line below.\n",
        "pw.set_license_key(\"demo-license-key-with-telemetry\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "5",
      "metadata": {},
      "source": [
        "And now, here come a few lines of code that read two datasets, try to match rows, and report matchings by writing to a csv file.\n",
        "\n",
        "\n",
        "The data is read from csv files.\n",
        "For the purpose of this demonstration we will simply print a table with matchings found on the data sample presented above.\n",
        "But the code below works also in a production environment. In production:\n",
        "- All csv files will be dynamically ingested from these directories in their order of appearance.\n",
        "- The output will be updated immediately as new data appears at input."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "6",
      "metadata": {},
      "outputs": [],
      "source": [
        "# Uncomment to download the required files.\n",
        "# %%capture --no-display\n",
        "# !wget https://public-pathway-releases.s3.eu-central-1.amazonaws.com/data/fuzzy_join_part_1_transactionsA.csv -O transactionsA.csv\n",
        "# !wget https://public-pathway-releases.s3.eu-central-1.amazonaws.com/data/fuzzy_join_part_1_transactionsB.csv -O transactionsB.csv"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "7",
      "metadata": {
        "lines_to_next_cell": 2
      },
      "source": [
        "We use [our csv connectors](/developers/user-guide/connect/connectors/csv_connectors/) to read the csv files:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "8",
      "metadata": {
        "lines_to_next_cell": 2
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "            | recipient_acc_no             | date       | amount | recipient  | sender         | sender_acc_no\n",
            "^GNS8X6R... | ES8300590000000002968016     | 2014-08-06 | 8529   | C. Barnard | Mario Miller   | PL59879710390000000009681693\n",
            "^KSJ0J11... | HU30186000000000000008280573 | 2020-06-04 | 8946   | M. Perez   | Jessie Roberts | ES2314520000000006226902\n",
            "^XG26N77... | PL40881800090000000005784046 | 2020-09-15 | 7541   | C. Baxter  | Hector Haley   | DE84733500000003419377\n",
            "^HXTV5C4... | PL44124061590000000008986827 | 2019-05-25 | 3580   | L. Prouse  | Ronald Adams   | SI54028570008259759\n",
            "^KTYMV6X... | PL65889200090000000009197250 | 2017-01-22 | 5048   | S. Card    | James Paletta  | PL46193013890000000009427616\n",
            "            | description\n",
            "^66ZMGEV... | Received EUR 3578 on 2019-05-25 by INTERNATIONAL dividend from R. Adams to L. Prouse, recipient account 0000000008986827 by None, amount EUR 3580, fees EUR 2\n",
            "^3Q58BEG... | Received EUR 5043 on 2017-01-23 by DOMESTIC payment from J. Paletta to S. Card, recipient account 0000000009197250 by None, amount EUR 5048, fees EUR 5\n",
            "^D15GKVR... | Received EUR 7540 on 2020-09-15 by INTERNATIONAL invoice from H. Haley to C. Baxter, recipient account 0000000005784046 by None, amount EUR 7541, fees EUR 1\n",
            "^Y9ZHJG0... | Received EUR 8521 on 2014-08-07 by INTERNATIONAL interest from M. Miller to C. Barnard, recipient account 000002968016 by BANCO DE MADRID, amount EUR 8529, fees EUR 8\n",
            "^B85GHJN... | Received EUR 8944 on 2020-06-06 by INTERNATIONAL transaction from J. Roberts to M. Perez, recipient account 00000000008280573 by BNP Paribas Securities Services, amount EUR 8946, fees EUR 2\n"
          ]
        }
      ],
      "source": [
        "class TransactionsA(pw.Schema):\n",
        "    recipient_acc_no: str = pw.column_definition(primary_key=True)\n",
        "    date: str\n",
        "    amount: str\n",
        "    recipient: str\n",
        "    sender: str\n",
        "    sender: str\n",
        "    sender_acc_no: str\n",
        "\n",
        "\n",
        "class TransactionsB(pw.Schema):\n",
        "    description: str = pw.column_definition(primary_key=True)\n",
        "\n",
        "\n",
        "transactionsA = pw.io.csv.read(\n",
        "    \"./transactionsA.csv\",\n",
        "    schema=TransactionsA,\n",
        "    mode=\"static\",\n",
        ")\n",
        "transactionsB = pw.io.csv.read(\n",
        "    \"./transactionsB.csv\",\n",
        "    schema=TransactionsB,\n",
        "    mode=\"static\",\n",
        ")\n",
        "pw.debug.compute_and_print(transactionsA)\n",
        "pw.debug.compute_and_print(transactionsB)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "9",
      "metadata": {
        "lines_to_next_cell": 2
      },
      "source": [
        "Then we use our fuzzy join functions to do the reconciliation between the two tables."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "10",
      "metadata": {
        "lines_to_next_cell": 0
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "            | left        | right       | confidence\n",
            "^KTYMV6X... | ^KTYMV6X... | ^3Q58BEG... | 1.0\n",
            "^XG26N77... | ^XG26N77... | ^D15GKVR... | 1.3333333333333333\n",
            "^HXTV5C4... | ^HXTV5C4... | ^66ZMGEV... | 1.5\n",
            "^GNS8X6R... | ^GNS8X6R... | ^Y9ZHJG0... | 0.8333333333333333\n",
            "^KSJ0J11... | ^KSJ0J11... | ^B85GHJN... | 1.0\n"
          ]
        }
      ],
      "source": [
        "def match_transactions(transactionsA, transactionsB):\n",
        "    matching = pw.ml.smart_table_ops.fuzzy_match_tables(transactionsA, transactionsB)\n",
        "    transactionsA_reconciled = (\n",
        "        pw.Table.empty(left=pw.Pointer, right=pw.Pointer, confidence=float)\n",
        "        .update_rows(transactionsA.select(left=None, right=None, confidence=0.0))\n",
        "        .update_rows(\n",
        "            matching.select(\n",
        "                pw.this.left, pw.this.right, confidence=pw.this.weight\n",
        "            ).with_id(pw.this.left)\n",
        "        )\n",
        "    )\n",
        "    return transactionsA_reconciled\n",
        "\n",
        "\n",
        "pw.debug.compute_and_print(match_transactions(transactionsA, transactionsB))"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "11",
      "metadata": {},
      "source": [
        "Success, all matchings were found!\n",
        "\n",
        "Super easy, few lines of code and you flawlessly manage datasets in different formats.\n",
        "Hassle-free.\n",
        "\n",
        "## Scaling with Pathway\n",
        "\n",
        "`smart_fuzzy_join` is able to handle much bigger datasets.\n",
        "Feel free to test it on your own data or use the full datasets from this tutorial,\n",
        "available [in this Google Spreadsheet](https://docs.google.com/spreadsheets/d/1cXAPcmkq0t0ieIQCBrdKPG2Fq_DimAzzxfHsDWrtdW0/edit?usp=sharing).\n",
        "\n",
        "<!-- It took TODO seconds to run the `match_formats` on the full datasets TODO rows each. -->\n",
        "\n",
        "In the tutorial we just printed a matching found on a small data sample. In a dynamic production environment:\n",
        "- All csv files will be dynamically ingested from these directories in order of appearance.\n",
        "- **The output will be updated immediately as new data appears at input.**\n",
        "\n",
        "## Conclusion and follow-up tasks\n",
        "\n",
        "While errors are human and we are unlikely to stop making some, we can free ourselves of the pain of correcting them each time something goes wrong.\n",
        "Sometimes, entries are harder to match and may require help: in that case you can check out our [extension](/developers/templates/fuzzy_join/fuzzy_join_chapter2) and see how we extend our pipeline with an auditor that supervises the process of reconciliation.\n",
        "From now on, you have no excuses for having mismatching logs: 'errare humanum est, perseverare diabolicum'!\n",
        "\n",
        "\n",
        "If you would like to get some more experience with Pathway, you can try those two challenges:\n",
        "\n",
        "**Challenge 1**\n",
        "\n",
        "Extend the `match_transactions` function so that, after finding a matching, it extends the first input table (standard csv format) with columns 'fees' and 'currency'.\n",
        "\n",
        "**Challenge 2**\n",
        "\n",
        "Try to augment the datasets so that they are still reasonable but `smart_fuzzy_join` fails to find all matchings \ud83d\ude09"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3 (ipykernel)",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.11"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}